The purpose of Auto Discovery is to find interesting data correlations and patterns in a database using a variety of machine learning tools and algorithms like partitioning, forecasting, clustering, correlations and trends. When Auto Discover is run, a series of data discoveries will be produced and presented in a Presentation.
The data discoveries produced can be grouped into three main categories:
- Line charts: based on forecasting algorithms
- Scatter and bubble charts: based on partitioning algorithms
- Other chart types: based on trends
Dimension Reduction and Ranking
Since querying all the possibilities (columns in the database) is infeasible, a pre-process of ranking and dimension reduction is used.
Given sample data from each column, data may be removed from the analysis based on simple distribution tests, noise tests and other heuristics.
Given sample data from several tables joined (dataset), runs multiple Random Forests (an algorithm that creates an ensemble of decision trees, or a decision forest) for randomly predicted columns. Extracting the information gain gives the following:
- Average information gain (rank)
- Reduce the columns with a low average information gain
Visualization of Auto-Discoveries
Line Charts
Line charts in Auto Discover are based on forecasting algorithms, which can only operate if time series data was added during the ETL.
Given the ranking and time hierarchy, ARIMA and Holt Winters forecasts are estimated (by splitting the data). The forecast with Mean Average Percentage Error (MAPE) above a hard coded threshold will be presented in a line chart. Note that forecasting will produce a maximum of four graphs.
Scatter and Bubble Charts
Scatter and bubble charts are the result of either the best partition algorithm, or the correlation (Pearson) test. The x-axis, y-axis, and size are picked randomly.
Given the ranking, the best possibilities are tested for partitioning using multi-class Support Vector Machine (SVM). The highest learning score will determine the query used to construct the partition graphs.
Other Charts
Tables, column and bar charts are based on statistical tests (the Pearson test and the Chi square test) which determine the presence of a correlation and a change in trend. Where a distinct change in trend occurs, a chart will be presented.
Charts are drawn using Pyramid's auto-visualization tool which is used to plot data using a set of augmented AI routines to determine the best visualization for a given data set and within that, the best way to plot the different elements on that given visualization. The chart selection depends on the amount of data represented (the number of points and attributes), the type of data presented and the best way to portray the different chart types.